Skip to content

KhulnaSoft AI Compute Grid - Deploy#1

Open
FortiShield wants to merge 7 commits into
CtxOS:mainfrom
FortiShield:khulnasoft
Open

KhulnaSoft AI Compute Grid - Deploy#1
FortiShield wants to merge 7 commits into
CtxOS:mainfrom
FortiShield:khulnasoft

Conversation

@FortiShield
Copy link
Copy Markdown

@FortiShield FortiShield commented Apr 16, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added comprehensive AI Compute Grid platform with multi-agent orchestration system
    • Implemented FastAPI control plane for job submission, tracking, and resource allocation
    • Added Ray-based distributed task execution with GPU support
    • Introduced Kubernetes operator and custom resource definitions for AIJob management
    • Added automatic node scaling based on queue depth and adaptive scheduling optimization
    • Implemented governance engine with quota enforcement and policy validation
    • Created interactive dashboard UI with real-time metrics, cost tracking, and job monitoring
  • Infrastructure

    • Added GitHub Actions CI/CD pipeline for automated Docker builds and Kubernetes deployments
    • Provided Terraform configurations for cloud provisioning and Ansible playbooks for GPU node setup
  • Documentation

    • Added project README with architecture overview and quick-start guide

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 16, 2026

Warning

Rate limit exceeded

@xeondesk has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 57 minutes and 46 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 57 minutes and 46 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 074e7934-7372-4508-be2c-d04c48148ae7

📥 Commits

Reviewing files that changed from the base of the PR and between eb24883 and 6fdca2b.

📒 Files selected for processing (3)
  • platform/scheduler/locks.py
  • platform/scheduler/scheduler.py
  • runtime/ray/worker_task.py
📝 Walkthrough

Walkthrough

A comprehensive new system for a production-ready hybrid GPU compute grid is introduced, including deployment automation (GitHub Actions), infrastructure provisioning (Terraform/Ansible/Kubernetes), multi-agent orchestration, distributed job scheduling, cost tracking, real-time monitoring APIs (FastAPI/WebSocket), governance/policy enforcement, and an interactive React dashboard.

Changes

Cohort / File(s) Summary
GitHub & Project Setup
.github/workflows/deploy.yml, .gitignore, README.md
Added GitHub Actions deployment workflow for Docker image builds/pushes and Kubernetes manifests; removed node_modules ignore pattern; introduced comprehensive README documenting the compute grid architecture, deployment steps, and security considerations.
Multi-Agent Orchestration
agents/example_agent.py, agents/types/base_agent.py, agents/types/specialized.py
Added abstract Agent base class and three specialized implementations (PlannerAgent, CriticAgent, ObserverAgent) with decision logging; introduced example client script for job submission and polling.
Infrastructure Provisioning
infra/terraform/cloud/main.tf, infra/ansible/gpu-node.yml
Added Terraform configuration for AWS GPU instances and Ansible playbook for GPU driver/NVIDIA Container Toolkit installation and Docker daemon configuration.
Kubernetes Manifests
infra/k8s/crds/aijob.yaml, infra/k8s/gpu/device-plugin.yaml, infra/k8s/ray/head.yaml, infra/k8s/ray/worker.yaml, infra/k8s/scaling/keda.yaml
Added custom AIJob resource definition, NVIDIA GPU device plugin DaemonSet, Ray head/worker deployments, and KEDA-based autoscaling configuration for Redis queue-driven scaling.
Platform API & Database
platform/api/auth.py, platform/api/db.py, platform/api/main.py, platform/api/monitor.py, platform/api/ws.py, platform/api/requirements.txt
Implemented FastAPI application with API-key authentication, SQLAlchemy ORM models (jobs, nodes, costs, decisions), job submission/status endpoints, WebSocket metrics streaming, monitoring endpoints for queue/cost metrics, and dependency list.
Platform AI & Governance
platform/ai/critic.py, platform/ai/workflow_generator.py, platform/governance/policy_engine.py, platform/governance/quotas.py
Added workflow criticism/approval logic, AI-driven goal-to-job-plan generation with Planner/Critic loops, policy validation against allowed agents/restricted parameters, and Redis-backed user quota tracking with usage accounting.
Platform Intelligence & Memory
platform/intelligence/adaptive_weights.py, platform/memory/vector_store.py
Implemented Redis-backed adaptive scheduler weights that adjust based on job success rates; added in-memory namespace-aware knowledge store for agent findings with TTL and metadata management.
Platform Operator
platform/operator/controller.py
Implemented Kopf-based Kubernetes operator listening for AIJob custom resources, enqueuing jobs to Redis on creation, managing deletion cleanup, and patching status updates.
Scheduler & Job Management
platform/scheduler/scheduler.py, platform/scheduler/lifecycle.py, platform/scheduler/autoscaler.py, platform/scheduler/cost_engine.py, platform/scheduler/heartbeat_monitor.py, platform/scheduler/state.py, platform/scheduler/locks.py, platform/scheduler/resource_manager.py
Added core scheduler loop with distributed locking and priority queue dispatch; job lifecycle transitions with retry logic; cloud autoscaler managing node provisioning; cost tracking for cloud jobs; node heartbeat-based fault recovery; Redis node registry with liveness checking; distributed lock implementation; and resource matching logic with adaptive weight selection.
Runtime DSL & Execution
runtime/dsl/dag_manager.py, runtime/dsl/engine.py, runtime/dsl/ray_adapter.py, runtime/ray/worker_task.py
Added DAG-based workflow dependency resolution, multi-step executor with context propagation and governance-gated step injection, Ray actor-based remote task execution, and GPU-bound distributed compute task with CUDA detection.
Audit & Monitoring
platform/audit/observer.py
Implemented autonomous anomaly observer loop detecting long-running jobs and excess mutation counts with kill action escalation.
Web UI - Configuration
ui/package.json, ui/postcss.config.js, ui/tailwind.config.js, ui/index.html, ui/src/index.css, ui/src/main.tsx
Added Vite/React project configuration with dependencies (React 18, Recharts, Tailwind, Framer Motion), PostCSS/Tailwind tooling, custom theme colors/animations, and React mount entrypoint.
Web UI - Components
ui/src/App.tsx, ui/src/components/GridDashboard.tsx, ui/src/components/GPUHeatmap.tsx, ui/src/components/NodeDrillDown.tsx, ui/src/components/MemoryGraph.tsx, ui/src/components/JobGantt.tsx, ui/src/components/CostTimeline.tsx, ui/src/components/AdaptivePanel.tsx, ui/src/components/DecisionTimeline.tsx, ui/src/components/PolicyGuardStatus.tsx, ui/src/components/UserQuotas.tsx
Implemented comprehensive React dashboard with WebSocket-connected metrics, GPU node registry heatmap, per-node resource drill-down, distributed task memory visualization, job execution timeline, cost burn tracking, adaptive weight monitoring, autonomous decision audit trail, and policy/quota status display.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~90 minutes


🐰 Hops with glee through cloud and wire,
A grid of GPUs climbs ever higher,
From Kubernetes pods to Redis queues,
The workers race with heterogeneous hues!
Through autoscaling springs we bound and hop,
This compute grid blooms and won't stop! 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.89% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately describes the primary addition: a comprehensive deployment system for KhulnaSoft AI Compute Grid including infrastructure, platform, runtime, agents, and UI components.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

🟡 Minor comments (13)
agents/types/base_agent.py-12-12 (1)

12-12: ⚠️ Potential issue | 🟡 Minor

Fix non-optional type hints for parameters defaulting to None.

Parameters at lines 12 and 19 have Dict[str, Any] type hints but default to None, which weakens the type contract. Use Optional[Dict[str, Any]] (or Dict[str, Any] | None in Python 3.10+) to properly indicate they accept None.

Typing fix
-from typing import Dict, Any
+from typing import Dict, Any, Optional
 
 class Agent(abc.ABC):
@@
-    def __init__(self, agent_id: str, role: str, config: Dict[str, Any] = None):
+    def __init__(self, agent_id: str, role: str, config: Optional[Dict[str, Any]] = None):
@@
-    def run(self, input_data: Any, context: Dict[str, Any] = None) -> Dict[str, Any]:
+    def run(self, input_data: Any, context: Optional[Dict[str, Any]] = None) -> Dict[str, Any]:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents/types/base_agent.py` at line 12, The __init__ signature in BaseAgent
uses non-optional Dict[str, Any] type hints for parameters that default to None;
update those parameters in BaseAgent.__init__ (e.g., the config parameter and
the other parameter on the same ctor) to Optional[Dict[str, Any]] or the Python
3.10+ union form Dict[str, Any] | None so the type hint matches the default
value of None and static type checkers won't complain.
platform/governance/quotas.py-34-38 (1)

34-38: ⚠️ Potential issue | 🟡 Minor

Quota usage accumulates indefinitely—no reset mechanism.

track_usage increments a Redis key that never resets. Without a TTL or periodic reset (e.g., monthly), users will eventually exhaust their quota permanently.

Would you like me to suggest a mechanism for periodic quota resets (e.g., using Redis key expiry or a scheduled reset job)?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform/governance/quotas.py` around lines 34 - 38, track_usage currently
calls r.incrbyfloat(f"{self.key_prefix}{user_id}:usage", actual_hours) with no
expiry so quota accumulates forever; fix by ensuring a reset mechanism: either
set a TTL on the usage key when it's first created (use r.expire or r.set(...,
nx=True, ex=MONTH_SECONDS) after incr in track_usage) or implement a separate
reset method (e.g., reset_monthly_quota(user_id) or reset_all_quotas()) and run
it from a scheduled job/cron to clear or roll over keys; reference the
track_usage method, self.key_prefix and the Redis client r when adding the
expire or creating the scheduled reset.
ui/src/components/GPUHeatmap.tsx-23-26 (1)

23-26: ⚠️ Potential issue | 🟡 Minor

Potential runtime error if node.resources is undefined.

Accessing node.resources.vram_gb will throw if resources is missing or null. Also, the GPU is hardcoded to "NVIDIA A10G" regardless of actual hardware.

🛡️ Defensive access with actual GPU data
       <div className="mt-3 grid grid-cols-2 gap-2 text-[10px]">
-        <div className="text-gray-500">VRAM: <span className="text-gray-300">{node.resources.vram_gb}GB</span></div>
-        <div className="text-gray-500">GPU: <span className="text-gray-300">NVIDIA A10G</span></div>
+        <div className="text-gray-500">VRAM: <span className="text-gray-300">{node.resources?.vram_gb ?? 'N/A'}GB</span></div>
+        <div className="text-gray-500">GPU: <span className="text-gray-300">{node.resources?.gpu ?? 'Unknown'}</span></div>
       </div>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ui/src/components/GPUHeatmap.tsx` around lines 23 - 26, In GPUHeatmap.tsx fix
the unsafe access to node.resources.vram_gb and the hardcoded GPU string: change
the JSX that references node.resources.vram_gb to defensively read the value
(e.g. optional chaining or a fallback like node.resources?.vram_gb ?? '—') and
replace the hardcoded "NVIDIA A10G" with the actual GPU model from the node
(e.g. node.resources?.gpu_model ?? node.gpu?.model ?? 'Unknown GPU') so the
component won't throw when resources is undefined and displays a sensible
fallback.
platform/scheduler/state.py-53-58 (1)

53-58: ⚠️ Potential issue | 🟡 Minor

update_node_status silently fails if node doesn't exist.

If the node was concurrently deleted, hget returns None, and this function does nothing. Callers (like the scheduler after dispatch) may not know the update failed.

🛡️ Consider returning success status or logging
 def update_node_status(node_id, status):
     node_data_str = r.hget("grid_nodes", node_id)
     if node_data_str:
         node_data = json.loads(node_data_str)
         node_data["status"] = status
         r.hset("grid_nodes", node_id, json.dumps(node_data))
+        return True
+    else:
+        logger.warning(f"Cannot update status for missing node: {node_id}")
+        return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform/scheduler/state.py` around lines 53 - 58, The update_node_status
function currently does nothing when r.hget("grid_nodes", node_id) is None;
change it to return a boolean success flag and surface a warning when the node
is missing: after calling r.hget("grid_nodes", node_id) check for None and if
missing call a logger (or r.logger) to warn that the node_id was not found and
return False; otherwise proceed with json.loads, update node_data["status"],
write back with r.hset and return True; keep function name update_node_status
and the use of "grid_nodes", r.hget, r.hset, json.loads and json.dumps so
callers can detect failure.
platform/audit/observer.py-27-31 (1)

27-31: ⚠️ Potential issue | 🟡 Minor

Incorrect default for missing created_at makes runtime always 0.

If created_at is missing, the code defaults to time.time(), making runtime = time.time() - time.time() ≈ 0. This silently masks the anomaly detection for jobs with missing timestamps.

🔧 Suggested fix
                 # 1. Runtime Anomaly
-                start_time = job_data.get("created_at", time.time())
+                start_time = job_data.get("created_at")
+                if start_time is None:
+                    logger.warning(f"Job {job_data['id']} missing created_at timestamp")
+                    continue
                 runtime = time.time() - start_time
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform/audit/observer.py` around lines 27 - 31, The code uses time.time()
as the default for job_data["created_at"], which makes runtime ≈ 0 when the
timestamp is missing and hides anomalies; change the logic in the runtime check
(variables: start_time, job_data, runtime, MAX_RUNTIME_SECONDS, logger.warning)
to detect a missing or invalid created_at (e.g., if "created_at" not in job_data
or falsy), log a distinct warning about the missing timestamp, and then set
start_time to 0 (or another sentinel) so runtime = now - 0 yields a large value
that will trigger the existing anomaly handling when appropriate.
ui/src/index.css-1-3 (1)

1-3: ⚠️ Potential issue | 🟡 Minor

Resolve Stylelint compatibility for Tailwind directives in this file.

Static analysis is currently flagging @tailwind and text-rendering value casing. Please align Stylelint config/rules (or local suppressions) so this stylesheet passes lint consistently.

Also applies to: 15-15

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ui/src/index.css` around lines 1 - 3, Stylelint is tripping on Tailwind
at-rules and the casing of the text-rendering value; update the Stylelint config
so `@tailwind/`@apply/@variants/@responsive/@screen are allowed by adding them to
the at-rule-no-unknown ignoreAtRules list (or install
stylelint-config-tailwindcss), and relax the value-casing rule for
text-rendering by either disabling
value-keyword-case/declaration-property-value-case for that property or adding
an exception (e.g., allow "optimizeLegibility") so index.css's `@tailwind`
directives and text-rendering values pass linting.
ui/src/components/UserQuotas.tsx-14-15 (1)

14-15: ⚠️ Potential issue | 🟡 Minor

Clamp progress width to avoid overflow artifacts.

Line 14–15 can exceed 100% (or go negative), which can break bar rendering at Line 36 and Line 49.

Suggested fix
-  const gpuProgress = (MOCK_QUOTAS.used / MOCK_QUOTAS.limit) * 100;
-  const costProgress = (MOCK_QUOTAS.cost_used / MOCK_QUOTAS.cost_limit) * 100;
+  const clamp = (v: number) => Math.max(0, Math.min(100, v));
+  const gpuProgress = clamp((MOCK_QUOTAS.used / MOCK_QUOTAS.limit) * 100);
+  const costProgress = clamp((MOCK_QUOTAS.cost_used / MOCK_QUOTAS.cost_limit) * 100);

Also applies to: 36-36, 49-49

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ui/src/components/UserQuotas.tsx` around lines 14 - 15, gpuProgress and
costProgress can go below 0 or above 100 and break the progress bar rendering;
clamp both computed values (gpuProgress and costProgress) to the 0–100 range
before they are used (e.g., wrap the existing expressions with
Math.min(Math.max(..., 0), 100)) and ensure any places that compute or pass
progress (the calculations using MOCK_QUOTAS.used/limit and
MOCK_QUOTAS.cost_used/cost_limit and where they are rendered) use the clamped
values to prevent overflow/negative widths.
ui/src/components/JobGantt.tsx-42-44 (1)

42-44: ⚠️ Potential issue | 🟡 Minor

Tooltip status text is incorrect for active jobs.

Line 43 always shows Result: Success, including when job.status is running. This is misleading state reporting in the dashboard.

Suggested fix
-                ID: {job.id} | Result: Success
+                ID: {job.id} | Status: {job.status}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ui/src/components/JobGantt.tsx` around lines 42 - 44, Tooltip currently
hardcodes "Result: Success" in JobGantt (the div that shows ID: {job.id}),
causing incorrect display for running jobs; change the tooltip to derive its
text from job.status (and job.result if available): when job.status ===
'running' show "Status: Running", when job.status === 'completed' show "Result:
Success" (or use job.result if provided), and when job.status === 'failed' show
"Result: Failed"; update the JSX in JobGantt to conditionally render this text
using job.status/job.result instead of the fixed "Result: Success".
platform/scheduler/cost_engine.py-17-19 (1)

17-19: ⚠️ Potential issue | 🟡 Minor

Clamp negative durations before billing.

Line 17–19 can compute negative cost if start_time is in the future (clock skew/bad input), which could reduce total spend incorrectly.

🛡️ Small guard
-    duration_seconds = time.time() - start_time
+    duration_seconds = max(0.0, time.time() - start_time)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform/scheduler/cost_engine.py` around lines 17 - 19, The billing code in
cost_engine.py computes duration_seconds, duration_hours and cost from
start_time and can yield negative values if start_time is in the future; clamp
the computed duration_seconds to zero before converting to hours and multiplying
by PRICE_PER_HOUR (e.g., set duration_seconds = max(0, time.time() -
start_time)) so that duration_hours and cost are never negative while preserving
the existing variable names duration_seconds, duration_hours, cost and constant
PRICE_PER_HOUR.
ui/src/components/GridDashboard.tsx-53-57 (1)

53-57: ⚠️ Potential issue | 🟡 Minor

selectedNode may become stale when nodes are replaced via WebSocket.

selectedNode is initialized with nodes[0] from the hardcoded initial state. When setNodes(data.nodes) replaces the array with server data, selectedNode still references the old object. Consider updating selectedNode when nodes change to keep it in sync.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ui/src/components/GridDashboard.tsx` around lines 53 - 57, selectedNode is
initialized from the initial nodes array and can become stale when setNodes
replaces the nodes array (e.g., from WebSocket updates); update selectedNode
whenever nodes change by adding effect logic: watch nodes (the state from
useState) and call setSelectedNode to a matching node from the new array (by id)
or fallback to nodes[0] or undefined as appropriate. Locate the useState
declarations for nodes, setNodes, selectedNode, setSelectedNode in
GridDashboard.tsx and implement a useEffect that syncs selectedNode with the
updated nodes array to avoid referencing the old object.
platform/governance/policy_engine.py-43-47 (1)

43-47: ⚠️ Potential issue | 🟡 Minor

String containment check for restricted params is easily bypassed.

Simple substring matching (line 46) can be bypassed with whitespace variations (rm -rf /), URL encoding, or command splitting. Consider using a more robust pattern-matching approach or command parsing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform/governance/policy_engine.py` around lines 43 - 47, The current
substring check using params_str = str(step.get("params", "")) against
RESTRICTED_PARAMS can be bypassed by whitespace, URL-encoding, or tokenization
tricks; update the check in the policy_engine to normalize and parse params
before matching: decode URL-encoding, collapse repeated whitespace, then
tokenize using a shell-aware parser (e.g., shlex) or apply robust regex patterns
with word boundaries/anchors to match whole commands/tokens rather than
substrings, and replace the simple containment test that appends to issues
(issues.append(...)) with this normalized/tokenized matching against
RESTRICTED_PARAMS.
platform/api/ws.py-24-25 (1)

24-25: ⚠️ Potential issue | 🟡 Minor

disconnect() may raise ValueError if websocket not in list.

If disconnect() is called twice for the same websocket (e.g., from exception handler after WebSocketDisconnect), list.remove() will raise ValueError. Use a safe removal pattern.

Proposed fix
     def disconnect(self, websocket: WebSocket):
-        self.active_connections.remove(websocket)
+        try:
+            self.active_connections.remove(websocket)
+        except ValueError:
+            pass  # Already removed
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform/api/ws.py` around lines 24 - 25, The disconnect method currently
calls self.active_connections.remove(websocket) which will raise ValueError if
the websocket isn't present; update the WebSocketSessionManager.disconnect (or
disconnect) implementation to remove the socket safely—either check "if
websocket in self.active_connections: self.active_connections.remove(websocket)"
or wrap remove in try/except ValueError and ignore missing entries—so repeated
calls (e.g., after WebSocketDisconnect) won't raise.
ui/src/components/GridDashboard.tsx-61-89 (1)

61-89: ⚠️ Potential issue | 🟡 Minor

WebSocket lacks error handling and reconnection logic.

The WebSocket connection has no onerror handler and no reconnection strategy. If the connection fails or drops, the dashboard will silently stop receiving updates without user feedback.

Proposed fix with error handling and reconnection
   useEffect(() => {
+    let reconnectTimeout: NodeJS.Timeout | null = null;
+    
+    const connect = () => {
       const wsUrl = `ws://${window.location.hostname}:8000/ws`;
       ws.current = new WebSocket(wsUrl);
 
       ws.current.onmessage = (event) => {
         // ... existing handler
       };
+      
+      ws.current.onerror = (error) => {
+        console.error('WebSocket error:', error);
+      };
+      
+      ws.current.onclose = () => {
+        console.log('WebSocket closed, reconnecting in 3s...');
+        reconnectTimeout = setTimeout(connect, 3000);
+      };
+    };
+    
+    connect();
 
     return () => {
+      if (reconnectTimeout) clearTimeout(reconnectTimeout);
       ws.current?.close();
     };
   }, []);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ui/src/components/GridDashboard.tsx` around lines 61 - 89, The WebSocket in
the useEffect currently only sets onmessage and closes on unmount, so add robust
error handling and reconnection: attach ws.current.onerror to log/process errors
and set a connection state (e.g., connectionStatus via setConnectionStatus),
implement ws.current.onclose to trigger a reconnect loop using exponential
backoff with a max retry cap, recreate ws.current and rebind
onmessage/onerror/onclose when reconnecting, ensure setMetrics, setNodes, and
setHistoricalData continue to be used after reconnect, and make sure to cancel
pending reconnect timers and close the socket in the cleanup to avoid leaks.
🧹 Nitpick comments (18)
agents/types/specialized.py (1)

8-8: Add Optional typing to parameters with None defaults.

Parameters at lines 8, 11, 25, 28, 37, and 40 have None defaults but non-optional type annotations (Dict[str, Any]). Import Optional from typing and update these annotations to Optional[Dict[str, Any]] for consistency with Python typing conventions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents/types/specialized.py` at line 8, Update type annotations for
parameters that have None defaults to use Optional[Dict[str, Any]]: import
Optional from typing and change the annotations for the affected parameters
(e.g., the __init__ method signature shown and the other functions/methods
referenced at lines 11, 25, 28, 37, and 40) from Dict[str, Any] to
Optional[Dict[str, Any]] so the type matches the default None; ensure you adjust
any corresponding type hints in those function/method definitions (keep
parameter names and defaults unchanged).
platform/audit/observer.py (2)

22-22: Style: Multiple statements on one line.

Per static analysis (E701), refactor to separate lines for readability.

♻️ Suggested fix
                 raw_data = redis_conn.get(key)
-                if not raw_data: continue
+                if not raw_data:
+                    continue
                 
                 job_data = json.loads(raw_data)
-                if job_data.get("status") != "running": continue
+                if job_data.get("status") != "running":
+                    continue

Also applies to: 25-25

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform/audit/observer.py` at line 22, The single-line statement "if not
raw_data: continue" (and the similar instance at the other occurrence) violates
E701; split it into two lines by putting the conditional and the continue on
separate lines so that the check using raw_data is on one line and the continue
is on the next, preserving the surrounding loop context (where the raw_data
check occurs) and keeping the same logic in observer.py.

4-4: Unused import: redis.

The module imports redis but uses redis_conn from platform.scheduler.state instead.

♻️ Remove unused import
 import time
 import json
 import logging
-import redis
 from platform.scheduler.state import r as redis_conn
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform/audit/observer.py` at line 4, The file imports the unused symbol
redis at the top of observer.py while the module uses the existing redis_conn
from platform.scheduler.state; remove the unused import statement "import redis"
to clean up the module and avoid unused-import linter errors, leaving any
references to redis_conn intact.
platform/ai/critic.py (1)

5-7: Avoid logging.basicConfig at module level.

Calling logging.basicConfig(level=logging.INFO) at import time can override the application's logging configuration if this module is imported before the main app configures logging. This is a common gotcha in library/module code.

♻️ Suggested fix
-# Setup logging
-logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger("AICritic")

Let the application's entry point configure basicConfig instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform/ai/critic.py` around lines 5 - 7, Remove the top-level call to
logging.basicConfig(level=logging.INFO) so this module does not configure global
logging on import; keep the logger = logging.getLogger("AICritic") line and
optionally add a short comment near logging.basicConfig or above logger noting
that logging should be configured by the application entry point (do not call
basicConfig inside platform/ai/critic.py).
ui/src/components/GPUHeatmap.tsx (1)

3-3: Add TypeScript types for props to improve type safety.

Using any for props sacrifices TypeScript's compile-time checks. Consider defining proper interfaces:

♻️ Suggested type definitions
interface NodeResources {
  vram_gb: number;
  gpu?: string;
}

interface Node {
  id: string;
  type: 'cloud' | 'local';
  status: 'busy' | 'ready' | 'idle';
  resources: NodeResources;
}

const NodeChip = ({ node }: { node: Node }) => {
  // ...
};

export default function GPUHeatmap({ nodes }: { nodes: Node[] }) {
  // ...
}

Also applies to: 31-31

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ui/src/components/GPUHeatmap.tsx` at line 3, Replace the use of `any` for
component props by declaring proper TypeScript interfaces and using them in the
component signatures: define interfaces like `NodeResources` and `Node` (with
fields vram_gb:number, optional gpu:string, id:string, type:'cloud'|'local',
status:'busy'|'ready'|'idle', resources:NodeResources) and change `const
NodeChip = ({ node }: any)` to `const NodeChip = ({ node }: { node: Node })`,
and likewise update `GPUHeatmap` to accept `nodes: Node[]`; export or colocate
the interfaces so both `NodeChip` and `GPUHeatmap` can reuse them.
platform/memory/vector_store.py (1)

24-24: Use collision-safe IDs for findings.

The current ID can collide when the same agent writes multiple findings within the same second.

♻️ Proposed fix
 import logging
 import json
 import time
+import uuid
@@
-            "id": f"{agent_id}_{int(time.time())}",
+            "id": f"{agent_id}_{uuid.uuid4()}",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform/memory/vector_store.py` at line 24, Replace the current
collision-prone ID generation ("id": f"{agent_id}_{int(time.time())}") with a
collision-safe scheme: generate a UUID (e.g., uuid.uuid4()) and combine it with
agent_id (e.g., f"{agent_id}_{uuid4.hex}") or use time.time_ns() if you prefer
monotonic nanosecond timestamps; update the ID assignment where the dict is
built and add the necessary import for uuid (or use time.time_ns()) so multiple
findings from the same agent in the same second cannot collide.
platform/scheduler/heartbeat_monitor.py (1)

24-33: Isolate per-job decode/read failures instead of aborting the whole pass.

A single bad/missing record currently skips recovery for all other jobs in that cycle.

🛡️ Proposed fix
-            for key in all_job_keys:
-                job_data = json.loads(redis_conn.get(key))
+            for key in redis_conn.scan_iter(match="job:*", count=500):
+                try:
+                    raw = redis_conn.get(key)
+                    if not raw:
+                        continue
+                    job_data = json.loads(raw)
+                except (TypeError, json.JSONDecodeError):
+                    logger.warning(f"Skipping unreadable job record: {key}")
+                    continue
                 if job_data.get("status") == "running":
@@
-        except Exception as e:
-            logger.error(f"Error in recovery loop: {str(e)}")
+        except Exception:
+            logger.exception("Error in recovery loop")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform/scheduler/heartbeat_monitor.py` around lines 24 - 33, The recovery
loop should not abort on a single bad job record; wrap the per-key read/decode
and subsequent checks in a per-job try/except so one broken entry doesn't stop
the whole pass: when iterating keys, guard the redis_conn.get(key) +
json.loads(...) + the status/node_id checks and fail_job call in a try block,
log any exception (including the key and error) via logger.error, skip that
entry and continue to the next key; also defensively handle redis_conn.get(key)
returning None before calling json.loads to avoid a TypeError. Use the existing
names (redis_conn.get, json.loads, job_data, active_node_ids, fail_job, logger)
to locate and modify the code.
ui/src/components/CostTimeline.tsx (1)

14-14: Replace any[] with a concrete point type.

Line 14 loses static guarantees for chart input shape; a typed CostPoint[] avoids silent runtime mismatches.

Suggested fix
- export default function CostTimeline({ data = MOCK_COST_DATA }: { data?: any[] }) {
+ type CostPoint = { time: string; spend: number };
+ export default function CostTimeline({ data = MOCK_COST_DATA }: { data?: CostPoint[] }) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ui/src/components/CostTimeline.tsx` at line 14, The prop type for
CostTimeline currently uses any[] which loses type safety; change the component
signature to use a concrete CostPoint[] (e.g., CostTimeline({ data =
MOCK_COST_DATA }: { data?: CostPoint[] })) and add or import a CostPoint
definition matching the chart shape (fields like timestamp/date, cost, category,
etc.) — update MOCK_COST_DATA to be typed as CostPoint[] and ensure any
functions inside CostTimeline (parsers, mappers) use CostPoint to preserve
static guarantees.
ui/src/components/UserQuotas.tsx (1)

58-59: Use quota source-of-truth for RPM text.

Line 58 hardcodes 60 requests/min instead of reading MOCK_QUOTAS.rpm_limit, which risks UI/data drift.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ui/src/components/UserQuotas.tsx` around lines 58 - 59, The UI hardcodes "60
requests/min" in the UserQuotas component instead of using the canonical quota
value; update UserQuotas.tsx to read the RPM display from the source-of-truth
(MOCK_QUOTAS.rpm_limit) and render that value (e.g., `${MOCK_QUOTAS.rpm_limit}
requests/min`) so the text stays in sync with the data model; reference the
UserQuotas component and the MOCK_QUOTAS.rpm_limit symbol and ensure any
necessary import or prop access is used to obtain the value before rendering.
ui/src/components/MemoryGraph.tsx (1)

13-17: Replace COLORS: any with a typed map to restore compile-time safety.

any on line 13 disables type checking for COLORS[entry.type] on line 49. While the current data in MOCK_MEMORY only uses valid type values ('exploit', 'recon', 'intel'), using a typed map prevents invalid keys from being accessed without detection.

♻️ Suggested typing change
-const COLORS: any = {
+type MemoryType = 'exploit' | 'recon' | 'intel';
+const COLORS: Record<MemoryType, string> = {
   exploit: '#f43f5e', // rose-500
   recon: '#3b82f6',   // blue-500
   intel: '#8b5cf6',   // violet-500
 };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ui/src/components/MemoryGraph.tsx` around lines 13 - 17, Replace the untyped
COLORS constant with a properly typed map so accesses like COLORS[entry.type]
are checked at compile time; update COLORS to use a Record or mapped type keyed
by the allowed entry types (e.g., 'exploit'|'recon'|'intel') or reference the
existing union/enum for memory entry types, keeping the same color string values
so code using COLORS (e.g., COLORS[entry.type] in MemoryGraph) benefits from
static type safety.
platform/api/monitor.py (2)

12-12: Move import to top of file.

PEP 8 recommends placing all imports at the top of the module. This import is placed after module-level variable assignments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform/api/monitor.py` at line 12, The import of get_active_nodes from
platform.scheduler.state is placed after module-level variable assignments; move
this import statement up into the main import block at the top of the module so
all imports (including "from platform.scheduler.state import get_active_nodes")
appear before any module-level variables or code execution to comply with PEP8.

31-32: Exception handling swallows HTTP status context.

Returning a dict with status: "error" on any exception results in a 200 OK HTTP response, making it harder for clients to detect failures. Consider raising an HTTPException to return appropriate status codes.

Proposed fix
+from fastapi import APIRouter, Depends, HTTPException
 ...
     except Exception as e:
-        return {"status": "error", "message": str(e)}
+        raise HTTPException(status_code=500, detail=str(e))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform/api/monitor.py` around lines 31 - 32, The except block that
currently does "except Exception as e: return {'status': 'error', 'message':
str(e)}" swallows HTTP status context and returns 200; change it to raise a
FastAPI HTTPException instead (import HTTPException and appropriate status from
fastapi/status) inside the same except Exception as e handler so the endpoint
returns a non-200 status (e.g., status.HTTP_500_INTERNAL_SERVER_ERROR) and
include str(e) as the detail; update the handler surrounding the function (the
endpoint function in platform/api/monitor.py) to raise HTTPException rather than
returning the dict.
runtime/dsl/engine.py (3)

3-3: Unused import: json.

The json module is imported but never used in this file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@runtime/dsl/engine.py` at line 3, The import statement "import json" in
runtime/dsl/engine.py is unused and should be removed; delete the unused import
line (the "import json" symbol) to clean up the module and avoid linter
warnings.

70-77: run_step is mocked and doesn't integrate with Ray adapter.

The run_step method simulates execution with time.sleep(1) instead of calling run_workflow_step_on_ray from runtime/dsl/ray_adapter.py. The docstring acknowledges this is a mock, but consider adding a TODO or feature flag to enable Ray integration.

Proposed integration with Ray adapter
+from runtime.dsl.ray_adapter import run_workflow_step_on_ray
+
+USE_RAY = False  # Feature flag for Ray execution
+
     def run_step(self, name, agent, params):
         """
-        Mock execution of a step. 
-        In production, this calls Ray tasks.
+        Executes a workflow step, optionally on Ray.
         """
-        # Simulate work
-        time.sleep(1)
-        return {"last_action": f"finished {name}", "data": f"processed_{name}"}
+        if USE_RAY:
+            return run_workflow_step_on_ray(agent, name, params)
+        else:
+            # Mock execution for testing
+            time.sleep(1)
+            return {"last_action": f"finished {name}", "data": f"processed_{name}"}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@runtime/dsl/engine.py` around lines 70 - 77, The run_step method currently
mocks execution with time.sleep(1) and returns fake results; replace or
conditionally delegate to the Ray integration by calling the
run_workflow_step_on_ray adapter function from runtime/dsl/ray_adapter.py (or
add a feature-flag/ENV check) so real Ray tasks are executed when enabled:
update run_step to import and call run_workflow_step_on_ray(agent, name, params)
when the Ray flag is set (fall back to the existing mock for local/testing), and
add a clear TODO comment and unit test path for both code paths; reference the
run_step method and run_workflow_step_on_ray symbol to locate where to change
behavior.

13-17: StepStatus enum is defined but never used.

The StepStatus enum is declared but the workflow engine uses string literals for status tracking instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@runtime/dsl/engine.py` around lines 13 - 17, The StepStatus enum is declared
but unused; update the workflow engine to use StepStatus members instead of raw
string literals for status tracking (e.g., replace occurrences of
"pending"/"running"/"completed"/"failed" with
StepStatus.PENDING/StepStatus.RUNNING/StepStatus.COMPLETED/StepStatus.FAILED or
their .value where a string is required). Search for status reads/writes in
functions like any step execution, status update, and serialization logic and
switch them to use the StepStatus enum (or call .value when
persisting/serializing) so the enum is actually used consistently across the
engine.
platform/governance/policy_engine.py (1)

11-11: MAX_JOBS_PER_SESSION is defined but never used.

This constant is declared but not enforced anywhere in the policy engine.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform/governance/policy_engine.py` at line 11, MAX_JOBS_PER_SESSION is
defined but never enforced; update the policy engine logic that creates or
enqueues jobs (e.g., functions/methods like add_job, schedule_job, enqueue_job,
or PolicyEngine.process_jobs) to check the current session job count against
MAX_JOBS_PER_SESSION and reject or defer new jobs when the limit is reached.
Implement the check to either raise a clear exception or return an error
status/log (use existing process_logger or error handling conventions), and add
unit tests to cover hitting the limit to ensure the constant is actually
applied.
platform/api/ws.py (1)

53-53: Use asyncio.get_running_loop() instead of deprecated pattern.

asyncio.get_event_loop().time() is deprecated in Python 3.10+. Use asyncio.get_running_loop().time() in async contexts.

Proposed fix
-                "timestamp": asyncio.get_event_loop().time()
+                "timestamp": asyncio.get_running_loop().time()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform/api/ws.py` at line 53, Replace the deprecated
asyncio.get_event_loop().time() call used to populate the "timestamp" value with
asyncio.get_running_loop().time() in the async WebSocket message construction
(the dict key "timestamp" in platform.api.ws); update the expression where
"timestamp" is set so it calls get_running_loop().time() to avoid the deprecated
pattern in async contexts.
runtime/dsl/ray_adapter.py (1)

25-36: Transient actor per step is inefficient; actors are never cleaned up.

Creating a new AgentExecutor actor for each workflow step (line 31) incurs significant overhead in Ray. Actors are designed to be reused across multiple calls. Additionally, the actor is never explicitly terminated, which may cause resource leaks in long-running workflows.

Consider using an actor pool or reusing actors across steps with the same agent_type.

Proposed fix with actor cleanup
 def run_workflow_step_on_ray(agent_type, step_name, params):
     """
     Helper to dispatch a DSL step to a Ray Actor.
     """
+    executor = None
     try:
         # Create a transient actor for the step
         executor = AgentExecutor.remote(agent_type, params)
         future = executor.execute_task.remote(step_name, params)
         return ray.get(future)
     except Exception as e:
-        logger.error(f"Ray execution error: {str(e)}")
+        logger.error("Ray execution error: %s", e)
         raise
+    finally:
+        if executor is not None:
+            ray.kill(executor)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@runtime/dsl/ray_adapter.py` around lines 25 - 36, The function
run_workflow_step_on_ray creates a new AgentExecutor actor for every call and
never cleans it up; instead implement a simple actor cache/registry keyed by
agent_type so AgentExecutor actors are created once and reused (lookup-or-create
an actor for agent_type, call executor.execute_task.remote(step_name, params),
and ray.get the future), and add a module-level shutdown/teardown function that
iterates the cached actors and calls ray.kill (or the appropriate Ray
termination API) to explicitly terminate them; reference AgentExecutor,
run_workflow_step_on_ray, execute_task and the actor cache/registry when making
the changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 997bba79-d76f-4657-9204-36fb09e66232

📥 Commits

Reviewing files that changed from the base of the PR and between 022eaf0 and eb24883.

📒 Files selected for processing (56)
  • .github/workflows/deploy.yml
  • .gitignore
  • README.md
  • agents/example_agent.py
  • agents/types/base_agent.py
  • agents/types/specialized.py
  • infra/ansible/gpu-node.yml
  • infra/k8s/crds/aijob.yaml
  • infra/k8s/gpu/device-plugin.yaml
  • infra/k8s/ray/head.yaml
  • infra/k8s/ray/worker.yaml
  • infra/k8s/scaling/keda.yaml
  • infra/terraform/cloud/main.tf
  • platform/ai/critic.py
  • platform/ai/workflow_generator.py
  • platform/api/auth.py
  • platform/api/db.py
  • platform/api/main.py
  • platform/api/monitor.py
  • platform/api/requirements.txt
  • platform/api/ws.py
  • platform/audit/observer.py
  • platform/governance/policy_engine.py
  • platform/governance/quotas.py
  • platform/intelligence/adaptive_weights.py
  • platform/memory/vector_store.py
  • platform/operator/controller.py
  • platform/scheduler/autoscaler.py
  • platform/scheduler/cost_engine.py
  • platform/scheduler/heartbeat_monitor.py
  • platform/scheduler/lifecycle.py
  • platform/scheduler/locks.py
  • platform/scheduler/resource_manager.py
  • platform/scheduler/scheduler.py
  • platform/scheduler/state.py
  • runtime/dsl/dag_manager.py
  • runtime/dsl/engine.py
  • runtime/dsl/ray_adapter.py
  • runtime/ray/worker_task.py
  • ui/index.html
  • ui/package.json
  • ui/postcss.config.js
  • ui/src/App.tsx
  • ui/src/components/AdaptivePanel.tsx
  • ui/src/components/CostTimeline.tsx
  • ui/src/components/DecisionTimeline.tsx
  • ui/src/components/GPUHeatmap.tsx
  • ui/src/components/GridDashboard.tsx
  • ui/src/components/JobGantt.tsx
  • ui/src/components/MemoryGraph.tsx
  • ui/src/components/NodeDrillDown.tsx
  • ui/src/components/PolicyGuardStatus.tsx
  • ui/src/components/UserQuotas.tsx
  • ui/src/index.css
  • ui/src/main.tsx
  • ui/tailwind.config.js
💤 Files with no reviewable changes (1)
  • .gitignore

build-and-push:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat .github/workflows/deploy.yml

Repository: CtxOS/ctxpod

Length of output: 1372


🌐 Web query:

actions/checkout v4 latest version 2024 2025

💡 Result:

The latest version of actions/checkout v4 is v4.3.1, released on 2025-11-17. This ports cleanup changes from v6 to the v4 branch. Prior versions include v4.3.0 (2025-08-11, documentation and dependency updates), v4.2.2 (2024-10), v4.2.0 (2024-09-25), and v4.0.0 (2023-10-17, updated to node20 runtime). Use uses: actions/checkout@v4 to pin to the v4 branch, which always points to the latest v4 release. Note that newer major versions like v6.0.2 (2026-01-09) exist with Node.js 24 support.

Citations:


🌐 Web query:

docker/setup-buildx-action latest version 2024 2025

💡 Result:

The latest version of the docker/setup-buildx-action GitHub Action is v4.0.0, released on 2026-03-05. This is a major release featuring Node 24 as the default runtime (requires Actions Runner v2.327.1+), removal of deprecated inputs/outputs, switch to ESM, and various dependency updates including @actions/core to 3.0.0 and @docker/actions-toolkit to 0.79.0. Previous recent versions: - v3.12.0 (2025-12-19): Deprecated install input, dependency bumps. - v3.11.1 and earlier in 2025. Usage example from official docs: - name: Set up Docker Buildx uses: docker/setup-buildx-action@v4 Repository last pushed on 2026-04-09 (as of search data), confirming v4.0.0 remains latest. Official Docker docs and GitHub Marketplace also reference v4.

Citations:


🌐 Web query:

docker/login-action latest version 2024 2025

💡 Result:

The latest version of the docker/login-action GitHub Action is v4.1.0, released on April 2, 2026. This is the current latest as of 2026-04-16. Recent releases in 2024-2026: - v4.1.0 (2026-04-02): Fix scoped Docker Hub cleanup path, dependency bumps including @aws-sdk/client-ecr to 3.1020.0 and @docker/actions-toolkit to 0.86.0. - v4.0.0 (2026-03-04): Node 24 as default runtime, switch to ESM, major dependency updates like @actions/core to 3.0.0. Usage example from official docs: - name: Login to Docker Hub uses: docker/login-action@v4.1.0 with: username: ${{ vars.DOCKERHUB_USERNAME }} password: ${{ secrets.DOCKERHUB_TOKEN }} Official repository: https://github.com/docker/login-action

Citations:


🌐 Web query:

docker/build-push-action latest version 2024 2025

💡 Result:

The latest version of the docker/build-push-action GitHub Action is v7.1.0, released on 2026-04-10. This is confirmed by the GitHub releases page listing it as the top release dated 10 Apr (which aligns with 2026-04-10 given today's date of 2026-04-16), and the tags page showing v7.1.0 as the first tag. Previous versions include v7.0.0 (2026-03-05) and v6.19.x in early 2026. In 2024, versions were in the v6.x series (e.g., v6.4.0 in July 2024), and in 2025 leading up to v6.19.0 in February 2026. Official examples in the README and Marketplace use v7.

Citations:


Update GitHub Actions to current major versions.

These pinned action versions are significantly outdated. Update to the latest available versions:

  • actions/checkout@v3 → use v4 or latest v6
  • docker/setup-buildx-action@v2 → use v3 or latest v4
  • docker/login-action@v2 → use latest v3 or v4
  • docker/build-push-action@v4 → use latest v6 or v7

The major versions in the workflow file are no longer current and may face compatibility issues with modern GitHub runners.

🧰 Tools
🪛 actionlint (1.7.12)

[error] 17-17: the runner of "actions/checkout@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/deploy.yml at line 17, Update the pinned GitHub Action
versions to current majors: replace uses: actions/checkout@v3 with a current
major (e.g., actions/checkout@v4 or `@v6`), update docker/setup-buildx-action@v2
to v3 or v4, docker/login-action@v2 to v3 or v4, and docker/build-push-action@v4
to v6 or v7; make these replacements where those action usages
(actions/checkout, docker/setup-buildx-action, docker/login-action,
docker/build-push-action) appear in the workflow so the workflow uses
maintained, compatible major versions.

Comment on lines +6 to +7
ami = "ami-gpu-ubuntu-22" # placeholder for actual GPU AMI
instance_type = "g5.xlarge" # NVIDIA A10G GPU instance
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify AMI literal shape in Terraform files (read-only).
rg -nP '^\s*ami\s*=\s*"([^"]+)"' infra/terraform/cloud/main.tf
python - <<'PY'
import re, pathlib
p = pathlib.Path("infra/terraform/cloud/main.tf")
text = p.read_text()
m = re.search(r'ami\s*=\s*"([^"]+)"', text)
if not m:
    print("No AMI assignment found")
else:
    ami = m.group(1)
    print("Found:", ami)
    print("Valid EC2 AMI pattern:", bool(re.fullmatch(r"ami-[0-9a-fA-F]{8,17}", ami)))
PY

Repository: CtxOS/ctxpod

Length of output: 182


Use a real AMI ID instead of the placeholder.

The current value ami-gpu-ubuntu-22 does not match the valid EC2 AMI ID format (pattern: ami-[0-9a-fA-F]{8,17}) and will cause Terraform provisioning to fail.

Suggested fix
 resource "aws_instance" "gpu_node" {
-  ami           = "ami-gpu-ubuntu-22" # placeholder for actual GPU AMI
+  ami           = var.gpu_ami_id
   instance_type = "g5.xlarge"       # NVIDIA A10G GPU instance
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@infra/terraform/cloud/main.tf` around lines 6 - 7, The AMI value is a
placeholder and not a valid EC2 AMI ID, so replace the hardcoded ami =
"ami-gpu-ubuntu-22" with a valid AMI lookup or valid AMI ID; specifically either
set ami to a real AMI string matching ami-[0-9a-fA-F]{8,17} or use a data source
(e.g., data "aws_ami") to look up the latest Ubuntu GPU AMI and reference that
result where the current ami attribute is used alongside instance_type =
"g5.xlarge".

Comment thread platform/api/auth.py
Comment on lines +8 to +12
# In production, these would be in a database or vault
VALID_API_KEYS = {
os.getenv("GRID_ADMIN_KEY", "khulnasoft-admin-123"): "admin",
os.getenv("GRID_AGENT_KEY", "khulnasoft-agent-456"): "agent"
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Hardcoded fallback API keys create security vulnerability.

If GRID_ADMIN_KEY or GRID_AGENT_KEY environment variables are not set, the system falls back to predictable hardcoded credentials ("khulnasoft-admin-123", "khulnasoft-agent-456"). An attacker could authenticate with these well-known values.

🔒 Proposed fix to require explicit configuration
-# In production, these would be in a database or vault
-VALID_API_KEYS = {
-    os.getenv("GRID_ADMIN_KEY", "khulnasoft-admin-123"): "admin",
-    os.getenv("GRID_AGENT_KEY", "khulnasoft-agent-456"): "agent"
-}
+# API keys must be explicitly configured via environment variables
+_admin_key = os.getenv("GRID_ADMIN_KEY")
+_agent_key = os.getenv("GRID_AGENT_KEY")
+
+if not _admin_key or not _agent_key:
+    raise RuntimeError("GRID_ADMIN_KEY and GRID_AGENT_KEY environment variables must be set")
+
+VALID_API_KEYS = {
+    _admin_key: "admin",
+    _agent_key: "agent"
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform/api/auth.py` around lines 8 - 12, VALID_API_KEYS currently uses
hardcoded fallback values via os.getenv(..., "khulnasoft-...") which creates a
predictable backdoor; update the initialization so os.getenv is called without
default fallback for GRID_ADMIN_KEY and GRID_AGENT_KEY, validate that both are
present, and if either is missing raise a clear runtime/configuration error (or
abort startup) rather than populating VALID_API_KEYS with defaults; reference
the VALID_API_KEYS variable and the environment names GRID_ADMIN_KEY and
GRID_AGENT_KEY and ensure future secrets come from secure config/vault instead
of hardcoded strings.

Comment thread platform/api/db.py
Comment on lines +1 to +4
from sqlalchemy import create_all, Column, String, Integer, Float, DateTime, JSON, ForeignKey
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, relationship
from sqlalchemy import create_engine
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Does SQLAlchemy expose create_allfromsqlalchemytop-level, or should table creation be done viaBase.metadata.create_all()?

💡 Result:

SQLAlchemy does not expose a top-level create_all function that can be imported directly from the sqlalchemy package (e.g., from sqlalchemy import create_all). Instead, table creation is performed via Base.metadata.create_all for ORM declarative models or metadata_obj.create_all for Core MetaData objects.

Citations:


🏁 Script executed:

# Check if there's a requirements file or setup file with SQLAlchemy version
find . -type f \( -name "requirements*.txt" -o -name "setup.py" -o -name "setup.cfg" -o -name "pyproject.toml" \) | head -20

Repository: CtxOS/ctxpod

Length of output: 87


🏁 Script executed:

# Read the db.py file to understand the context
cat -n platform/api/db.py | head -50

Repository: CtxOS/ctxpod

Length of output: 2161


🏁 Script executed:

# Search for usage of ForeignKey and relationship in the codebase
rg "ForeignKey|relationship" platform/api/ --max-count 20

Repository: CtxOS/ctxpod

Length of output: 241


🏁 Script executed:

# Check SQLAlchemy version
cat platform/api/requirements.txt

Repository: CtxOS/ctxpod

Length of output: 86


🏁 Script executed:

# Get full db.py to see if ForeignKey and relationship are actually used
wc -l platform/api/db.py

Repository: CtxOS/ctxpod

Length of output: 77


🏁 Script executed:

# Read the full db.py file
cat -n platform/api/db.py

Repository: CtxOS/ctxpod

Length of output: 2873


🏁 Script executed:

# Check if there's a pyproject.toml or other config that might specify SQLAlchemy version
cat pyproject.toml 2>/dev/null || cat setup.py 2>/dev/null || echo "No standard config found"

Repository: CtxOS/ctxpod

Length of output: 80


🌐 Web query:

SQLAlchemy declarative_base import location: sqlalchemy.ext.declarative vs sqlalchemy.orm in SQLAlchemy 1.4 and 2.0

💡 Result:

In SQLAlchemy 1.4, declarative_base is imported from sqlalchemy.orm: from sqlalchemy.orm import declarative_base. The legacy import from sqlalchemy.ext.declarative is deprecated since 1.4, as declarative functionality was integrated into sqlalchemy.orm, emitting deprecation warnings for ext.declarative usage. In SQLAlchemy 2.0, declarative_base is superseded by subclassing DeclarativeBase from sqlalchemy.orm: from sqlalchemy.orm import DeclarativeBase; class Base(DeclarativeBase): pass. The function-based declarative_base is legacy/deprecated, with DeclarativeBase providing better typing support without plugins. Example for 1.4: from sqlalchemy.orm import declarative_base Base = declarative_base Example for 2.0: from sqlalchemy.orm import DeclarativeBase class Base(DeclarativeBase): pass

Citations:


Fix SQLAlchemy imports; create_all cannot be imported from top-level SQLAlchemy.

The import from sqlalchemy import create_all is invalid and will fail at import time. Additionally, ForeignKey and relationship are imported but never used. The import of declarative_base should also be updated to use the current recommended location in sqlalchemy.orm (the import from sqlalchemy.ext.declarative is deprecated as of SQLAlchemy 1.4).

Proposed fix
-from sqlalchemy import create_all, Column, String, Integer, Float, DateTime, JSON, ForeignKey
-from sqlalchemy.ext.declarative import declarative_base
-from sqlalchemy.orm import sessionmaker, relationship
-from sqlalchemy import create_engine
+from sqlalchemy import Column, String, Integer, Float, DateTime, JSON, create_engine
+from sqlalchemy.orm import declarative_base, sessionmaker
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform/api/db.py` around lines 1 - 4, Remove the invalid top-level import
of create_all and the unused ForeignKey and relationship imports, and update the
deprecated declarative_base import to the current location; specifically, in
platform/api/db.py stop importing create_all from sqlalchemy (use
Base.metadata.create_all(engine) instead), replace "from
sqlalchemy.ext.declarative import declarative_base" with "from sqlalchemy.orm
import declarative_base", and remove ForeignKey and relationship from the import
list so you only import Column, String, Integer, Float, DateTime, JSON (and keep
sessionmaker and create_engine) as needed.

Comment thread platform/api/main.py
Comment on lines +108 to +113
@app.get("/job/{job_id}")
def get_job_status(job_id: str):
job_data = r.get(f"job:{job_id}")
if not job_data:
raise HTTPException(status_code=404, detail="Job not found")
return json.loads(job_data)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Protect /job/{job_id} with auth and ownership checks.

This endpoint returns the full job payload, including params and user_scope, without authentication. Any caller who knows a job ID can read another tenant's job metadata.

🔐 Minimal authorization guard
 `@app.get`("/job/{job_id}")
-def get_job_status(job_id: str):
+def get_job_status(job_id: str, user_scope: str = Depends(get_api_key)):
     job_data = r.get(f"job:{job_id}")
     if not job_data:
         raise HTTPException(status_code=404, detail="Job not found")
-    return json.loads(job_data)
+    job = json.loads(job_data)
+    if job.get("user_scope") != user_scope:
+        raise HTTPException(status_code=403, detail="Forbidden")
+    return job
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform/api/main.py` around lines 108 - 113, The get_job_status endpoint
currently returns full job payloads without authentication; require
authentication and enforce ownership/tenant checks before returning job_data:
add the existing auth dependency (e.g., get_current_user or your auth
middleware) to the get_job_status route, parse job_data (from r.get) into an
object, compare the job's owner/tenant field (e.g., job_obj["user_id"] or
job_obj["user_scope"]) against the authenticated user's id/scope, and if they
don't match raise HTTPException(401/403). Also consider removing or redacting
sensitive fields like "params" and "user_scope" from the returned payload unless
the authenticated user is authorized to see them. Ensure these checks live in
get_job_status so unauthorized callers cannot fetch another tenant's job.

Comment thread platform/scheduler/locks.py Outdated
Comment thread platform/scheduler/scheduler.py Outdated
Comment on lines +92 to +93
time.sleep(2) # Small gap between lock attempts
time.sleep(5)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix the stray indentation at Line 93.

The second time.sleep(5) is indented under no enclosing block, so Python raises IndentationError and the scheduler module cannot be imported.

🧹 Minimal fix
-        time.sleep(2) # Small gap between lock attempts
-            time.sleep(5)
+        time.sleep(2) # Small gap between lock attempts
🧰 Tools
🪛 Ruff (0.15.10)

[warning] 93-93: Unexpected indentation

(invalid-syntax)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform/scheduler/scheduler.py` around lines 92 - 93, There is a stray
indentation on the second time.sleep(5) call which causes an IndentationError;
locate the two calls to time.sleep (the time.sleep(2) and the indented
time.sleep(5)) and unindent time.sleep(5) so it lines up with time.sleep(2) (or
remove the duplicate sleep if unintended), ensuring both calls are at the same
block level in the scheduler lock/retry logic.

Comment on lines +35 to +51
def get_active_nodes():
"""
Returns a list of all nodes that haven't timed out.
"""
all_nodes = r.hgetall("grid_nodes")
active_nodes = []

for node_id, data_str in all_nodes.items():
# Check if heartbeat key still exists
if r.exists(f"node_heartbeat:{node_id}"):
active_nodes.append(json.loads(data_str))
else:
# Cleanup stale node
logger.warning(f"Node {node_id} timed out. Removing from registry.")
r.hdel("grid_nodes", node_id)

return active_nodes
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Race condition: concurrent get_active_nodes() calls can corrupt state.

This function is called from multiple concurrent processes (scheduler, heartbeat_monitor, autoscaler per context snippets). The pattern of iterating hgetall, checking exists, and deleting via hdel is not atomic—creating race conditions:

  1. TOCTOU: Between r.exists() and r.hdel(), another process may register or use the node
  2. Scheduler dispatch race: The scheduler may select a node that another concurrent call is about to delete
  3. Lost updates: autoscaler.py also directly calls hdel on the same keys
🔒 Consider using a Lua script for atomic cleanup or distributed locking
+CLEANUP_SCRIPT = """
+local all_nodes = redis.call('HGETALL', KEYS[1])
+local active = {}
+for i = 1, `#all_nodes`, 2 do
+    local node_id = all_nodes[i]
+    local data = all_nodes[i+1]
+    if redis.call('EXISTS', 'node_heartbeat:' .. node_id) == 1 then
+        table.insert(active, data)
+    else
+        redis.call('HDEL', KEYS[1], node_id)
+    end
+end
+return active
+"""
+
 def get_active_nodes():
-    all_nodes = r.hgetall("grid_nodes")
-    active_nodes = []
-    
-    for node_id, data_str in all_nodes.items():
-        if r.exists(f"node_heartbeat:{node_id}"):
-            active_nodes.append(json.loads(data_str))
-        else:
-            logger.warning(f"Node {node_id} timed out. Removing from registry.")
-            r.hdel("grid_nodes", node_id)
-            
-    return active_nodes
+    result = r.eval(CLEANUP_SCRIPT, 1, "grid_nodes")
+    return [json.loads(data) for data in result]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@platform/scheduler/state.py` around lines 35 - 51, get_active_nodes currently
does r.hgetall + per-entry r.exists + r.hdel which is a TOCTOU race; replace
that non-atomic pattern by moving the check-and-remove into a Redis Lua EVAL
call so the whole iteration, existence check of node_heartbeat:{node_id},
deletion from hash ("grid_nodes") and collection of active entries run
atomically; update the get_active_nodes function to call that Lua script
(instead of using r.hgetall / r.exists / r.hdel), returning the active node JSON
strings, and keep the same returned structure so callers of get_active_nodes
don't change.

Comment thread runtime/ray/worker_task.py Outdated
xeondesk and others added 4 commits April 26, 2026 01:54
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants